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So far... 


= We've studied how to perform: 


a String matching 


efficiently and effectively. 
= We've seen how string matching is important 


in data integration 


= Now, we'll see how string matching is 
important in data cleaning 


xample (1) 


Addr 
430-871-8294 | Maple St 
Harrison Ford | 292-918-2913 | Culver Blvd 
Tom Hanks 234-762-1234 | Main St 


Name 


Jack Lemmon 


the same entity 


Name 


E 


Table S 


SSN 


Addr 


Ton Hanks 


234-162-1234 


Main Street 


Kevin Spacey | - 


Frost Blvd 


Jack Lemon 


430-817-8294 


Maple 
Street 


= Find records from different datasets that could be 


Example (2) 


and 


are the same 
object? 


E 


xample (3) 


P. Bernstein, D. Chiu: Using Semi-Joins to Solve 
Relational Queries. JACM 28(1): 25-40(1981) 


Philip A. Bernstein, Dah-Ming W. Chiu, Using 
Semi-Joins to Solve Relational Queries, Journal 
of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981 


= These two bibliographic references concern the 
same publication! 


_ 


The three examples refer to the same problem that 
is known under different names: 


Q 


Q 
Q 
Q 
Q 


It is one of the data quality problems addressed 


approximate duplicate detection 
record linkage 

entity resolution 

merge-purge 

data matching ... 


by data cleaning 


Outline 


= Introduction to data cleaning 
= Application contexts of data cleaning 


= Data quality dimensions 


= Taxonomy of data quality problems 


= Data quality process 
= Main data quality tools 
= Real-world examples 


Why Data Cleaning? 


Data in the real world is dirty 
incomplete: lacking attribute values, lacking certain attributes 
of interest, or containing only aggregate data 
= e.g., occupation="” 
noisy: containing errors (spelling, phonetic and typing errors, 
word transpositions, multiple values in a single free-form 
field) or outliers 
= e.g., Salary=*-10” 
inconsistent: containing discrepancies in codes or names 
(synonyms and nicknames, prefix and suffix variations, 
abbreviations, truncation and initials) 
= e.g., Age="42” Birthday=“03/07/1997” 
= e.g., was rating “1,2,3”, now rating “A, B, C” 
= e.g., discrepancy between approximate duplicate records 


Data Quality Problems (Dirty Data) 


Ref. integrity 


Representation 


Contradictions 


(| 999999999 | ooo 
9654321 E | 55555 


963124568 1000 


j 
Missing values | Duplicates 


Incorrect values 


ADDRESS 


Emo | 
Er 


10 


Impact of Data Quality Problems 


= Incorrect prices in inventory retail 
databases [English 1999] 
a Costs for consumers 2.5 billion $ 


a 80% of barcode-scan-errors to the 
disadvantage of consumer 


= IRS 1992: almost 100,000 tax refunds not 
deliverable [English 1999] 

= 50% to 80% of computerized criminal 
records in the U.S. were found to be 
inaccurate, incomplete, or ambiguous. 
[Strong et al. 1997a] 


= US-Postal Service: of 100,000 mass- 


incorrect addresses [Pierce 2004] 


IRS might 
be after you 
— to mail 
you a check 
Incorrectaddresses 
stall nearly 1,500 
Tennessee refunds 


By BONNA de la CRUZ 


Why Is Data Dirty? 


= Incomplete data comes from: 
a non available data value when collected 


a different criteria between the time when the data was collected and 


when it is analyzed 
a human/hardware/software problems 
= Noisy data comes from: 
a data collection: faulty instruments 
a data entry: human or computer errors 
a data transmission 


= Inconsistent (and duplicate) data comes from: 
a Different data sources, so non-uniform naming conventions/data codes 


a Functional dependency and/or referential integrity violation 


Application contexts 


= Integrate data from different sources 


= E.g.,populating a DW from different operational data stores or a 
mediator-based architecture 


= Eliminate errors and duplicates within a single source 
= E.g., duplicates in a file of customers 
= Migrate data from a source schema into a different fixed 
target schema 
= E.g., discontinued application packages 
= Convert poorly structured data into structured data 


= E.g., processing data collected from the Web 


When materializing the integrated data 


(data warehousing)... 


SOURCE DATA 


70% of the time in a data warehousing project is spent with 
the ETL process 


TARGET DATA 
Data Data Data || 
Extraction [Transformation Loading 


ETL: Extraction, Transformation and Loading 


Why is Data Cleaning Important? 


Activity of converting source data into target data without 
errors, duplicates, and inconsistencies, i.e., 


Cleaning and Transforming to get... 
High-quality data! 


= No quality data, no quality decisions! 
a Quality decisions must be based on good quality data (e.g., 
duplicate or missing data may cause incorrect or even 
misleading statistics) 


Quality 


“Even though quality 
cannot be defined, you 
know what it is.” 

Robert Pirsig es 
ZEN 
ANDTHE 
e 


ARTS WOOF 


MOTORCYCLE 
MAINTENANCE 


With a New Intraduetion by the Authe 
16 ROBERT M. PIRSIG 


Outline 


= Introduction to data cleaning 

= Application contexts of data cleaning 
> Data quality dimensions 

= Taxonomy of data quality problems 
= Data quality process 

= Main data quality tools 

= Real-world examples 


What is Data of Good Quality? 


1 Fitness for use 
J Accuracy, Objectivity, Believability, 
Reputation, Accessibility, Security, 
15 Relevance, Value-Added, Timeliness, 


Completeness, Amount of Data, 
Interpretability, Understandability, 
Consistency, Concise Representation 


coma 
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179 Dimensions 


> oo  —_ 


Category IQ Criteria | TDQM MBIS Weikum DWQ SCOUG Chen 
Content- Accuracy Yes Yes Yes Yes Yes Yes 
related Documentation Yes 
Criteria Relevancy Yes Yes Yes Yes 
Value-Added Yes Yes 
Completeness Yes Yes Yes Yes Yes Yes 
Interpretability Yes Yes 
Technical Timeliness Yes Yes Yes Yes Yes Yes 
Criteria Reliability Yes 
Latency Yes Yes 
Performability Yes Yes 
Response time Yes Yes Yes 
Security Yes Yes Yes 
Accessibility Yes Yes Yes Yes Yes 
Price Yes Yes Yes 
Customer Support Yes 
Intellectual Believability Yes Yes Yes Yes Yes 
Criteria Reputation Yes Yes Yes 
Objectivity Yes 
Instantiation Verifiability Yes 
related Amount of data Yes Yes Yes 
Criteria Understandability Yes Yes 
Concise represent. Yes 
Consistent represent. Yes Yes Yes Yes Yes 


Data Quality Dimensions (classical) 


Accuracy 


a Refers to the closeness of values in a database to the true values of the 
entities that the data in the database represent; if it is not 100% that 
means that there are errors in data 


Example:"Jhn” vs. “John” 


Completeness 


a Concerns whether the database has complete information to answer 
queries 


a Partial knowledge of the records in a table or of the attributes in a record 


Currency 


a Aims at identifying the current values of entities represented by tuples in 
a database and to answer queries using those values 


Example: Residence (Permanent) Address: out-dated vs. up-to-dated 


Consistency 


a Refers to the validity and integrity of data representing real-world entities; 
if it is violated, leads to discrepancies and conflicts in the data 


Example: ZIP Code and City inconsistent 
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Accutacy 


= Closeness between a value v and a value v’ , considered as the 
correct representation of the real-world phenomenon that v 
aims to represent. 


a Ex: fora person name “John”, v’ =John is correct, v=Jhn is incorrect 


Syntatic accuracy: closeness of a value v to the elements of the 
corresponding definition domain D 


a Ex: if v=Jack, even if v’ =John , v is considered syntactically correct, 
because it is an admissible value in the domain of people names. 


a Measured by means of comparison functions (e.g., edit distance) that 
evaluate the distance between v and the values of the domain 


Semantic accuracy: closeness of the value v to the true value v’ 


a Measured with a <yes, no> or <correct, not correct> domain 
a Coincides with correctness 


a The corresponding true value has to be known 


Ganularity of accuracy definition 


= Accuracy may refer to: 

a a single value of a relation attribute 
a an attribute or column 

a a relation 

a the whole database 
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Metrics for quantifying accuracy 


= Weak accuracy error 


a Characterizes accuracy errors that do not affect 
identification of tuples 


= Strong accuracy error 


a Characterizes accuracy errors that affect 
identification of tuples 


= Percentage of accurate tuples 


a Characterizes the fraction of accurate tuples 
matched with a reference table 


Completeness 


= “The extent to which data are of sufficient 
breadth, depth, and scope for the task in 
hand.” 

= Three types: 


a Schema completeness: degree to which concepts 
and their properties are not missing from the 
schema 

a Column completeness: evaluates the missing 

values for a specific property or column in a table. 


a Population completeness: evaluates missing 


— values with respect to a reference population — 


24 
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Completeness of relational data 


= The completeness of a table characterizes the extent to 
which the table represents the real world. 


= Can be characterized with respect to: 
o The presence/absence and meaning of null values 
Example: In Person (name, surname, birthdate, email), if 
email is null may indicate the person has no mail (no 
incompleteness), email exists but is not known (incompleteness), it is 
not known whether Person has an email (incompleteness may not be 
the case) 


a Validity of open world assumption (OWA) or closed world 
assumption (CWA) 
= OWA: assumes that in addition to missing values, some tuples 
representing real-world entities may also be missing 
a CWA: assumes the database has collected all the tuples 
representing real-world entities, but the values of some attributes 


Metrics for quantifying completeness (1) 


= Model without null values with OWA 


a Needs a reference relation ref (r ) for a relation 
r, that contains all the tuples that satisfy the 
schema of r 


C(r) = |rl/|ref(r)| 


Example: according to a registry of Lisbon municipality, 
the number of citizens is 2 million. If a company stores 
data about Lisbon citizens for the purpose of its 
business and that number is 1,400,000 then C(r) = 0,7 
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Metrics for quantifying completeness (2) 


= Model with null values with CWA: specific 
definitions for different granularities: 


a Values: to capture the presence of null values for 
some fields of a tuple 


a Tuple: to characterize the completeness of a tuple 
wrt the values of all its fields: 


= Evaluates the % of specified values in the tuple wrt the 
total number of attributes of the tuple itself 


Example: Student(stID, name, surname, vote, 
examdate) 


Equal to 1 for (6754, Mike, Collins, 29, 7/17/2004) 
Equal to 0.8 for (6578, Julliane, Merrals, NULL, 7/17/2004) 


Metrics for quantifying completeness (3) 


a Attribute: to measure the number of null values of 
a specific attribute in a relation 
= Evaluates % of specified values in the column 


corresponding to the attribute wrt the total number of 
values that should have been specified. 


Example: For calculating the average of votes in Student, 
a notion of the completeness of Vote should be useful 
a Relations: to capture the presence of null values 
in the whole relation 
= Measures how much info is represented in the relation 
by evaluating the content of the info actually available 


wrt the maximum possible content, i.e., without null 
values. 
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Time-related dimensions 


Currency: concerns how promptly data are updated 


o Example: if the residential address of a person is updated (it 
corresponds to the address where the person lives) then the 
currency is high 


Volatility: characterizes the frequency with which data vary 
in time 
o Example: Birth dates (volatility zero) vs stock quotes (high 
degree of volatility) 


Timeliness: expresses how current data are for the task in 
hand 
o Example: The timetable for university courses can be current by 
containing the most recent data, but it cannot be timely if it is 
available only after the start of the classes. 


Metrics of time-related dimensions 


= Last update metadata for currency 


a Straightforward for data types that change with a 
fixed frequency 


= Length of time that data remain valid for 
volatility 


= Currency + check that data are available 
before the planned usage time for timeliness 


30 
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Consistency 


Captures the violation of semantic rules 
defined over a set of data items, where data 
items can be tuples of relational tables or 
records in a file 

o Integrity constraints in relational data 


= Domain constraints, key definitions, inclusion and 
functional dependencies 


Other dimensions 


Interpretability: concerns the documentation and 
metadata that are available to correctly interpret 
the meaning and properties of data sources 


Synchronization between different time series: 
concerns proper integration of data having 
different time stamps. 

Accessibility: measures the ability of the user to 
access the data from his/her own culture, 
physical status/functions, and technologies 
availavle. 
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Outline 


= Introduction to data cleaning 

= Application contexts of data cleaning 
= Data quality dimensions 

> Taxonomy of data quality problems 
= Data quality process 

= Main data quality tools 

= Real-world examples 


Taxonomy of data quality problems 


[Oliveira 2009] 


= Value-level 

= Value-set (attribute/column) level 
= Record level 

= Relation level 

= Multiple relations level 


Value level 


Missing value: value not filled in a not null attribute 
a Ex: birth date = ° 

Syntax violation: value does not satisfy the syntax 
rule defined for the attribute 
a Ex: zip code = 27655-175; syntactical rule: xxxx-xxx 

Spelling error 
a Ex: city = ‘Lsboa’, instead of ‘Lisbon’ 

Domain violation: value does not belong to the 
valid domain set 


a Ex: age = 240; age: (0, 120} 
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Value-set and Record levels 


Value-set level 

a Existence of synonyms: attribute takes different values, but 
with the same meaning 
= Ex: emprego = ‘futebolista’; emprego = ‘jogador futebol’ 

a Existence of homonyms: same word used with diff meanings 
= Ex: same name refers to different authors of a publication 

a Uniqueness violation: unique attribute takes the same value 
more than once 
= Ex: two clients have the same ID number 

a Integrity contraint violation 
= Ex: sum of the values of percent attribute is more than 100 


Record level 


a Integrity constraint violation 
= Ex: total price of a product is different from price plus taxes 
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Relation level 


Heterogeneous data representations: different ways of 
representing the same real world entity 
o Ex: name = ‘John Smith’; name = ‘Smith, John’ 


Functional dependency violation 

o Ex: (2765-175, 'Estoril') and (2765-175, 
'Oeiras') 

Existence of approximate duplicates 
o Ex: (1, André Fialho, 12634268) and (2, 

André Pereira Fialho, 12634268) 

Integrity constraint violation 

a Ex: sum of salaries is superior to the max established 


Multiple tables level 


Heterogeneous data representations 
a Ex: one table stores meters, another stores inches 


Existence of synonyms 
Existence of homonyms 


Different granularities: same real world entity 
represented with diff. granularity levels 
a Ex: age: {0-30, 31-60, > 60}; age: {0-25, 26-40, 
40-65, >65} 
Referential integrity violation 


Existence of approximate duplicates 
Integrity constraint violation 8 
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Outline 


= Introduction to data cleaning 

= Application contexts of data cleaning 
= Data quality dimensions 

= Taxonomy of data quality problems 
> Data quality process 

= Main data quality tools 

= Real-world examples 


Data Quality Process 


1. Data Quality Auditing (Assessment) 
a Data Profiling 
a Data Analysis 


2. Data Quality Improvement 
a Data Cleaning 
a Data Enrichment 
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Data quality auditing 


= Constituted by: 
a Data profiling — analysing data sources to identify data 
quality problems 
a Data analysis — statistical evaluation, logical study and 
application of data mining algorithms to define data 
patterns and rules 


= Main goals: 
= To obtain a definition of the data: metadata collection 
= To check violations to metadata definition 
= To detect other data quality problems that belong to a given 
taxonomy 
“= To supply recommendations in what concerns the data cleaning 
task 4 


Data Profiling 


o Data source discovery 

= Metadata 
o Schema discovery 

= Schema matching and mapping 

= Profiling for metadata (keys, foreign keys, data types, ...) 
o Data discovery 


= Column-level: Null-values, domains, patterns, value 
distributions / histograms 
= Table-level: Data mining, rules 
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Typical techniques used in data 


quality auditing 


= Dictionaries of words: so that attribute values are compared 
with one or more dictionaries of the domain 
o Ex: wordnet 


= Algorithms to detect functional dependencies and their 


violations 
= Algorithms to detect duplicates 
o String matching for string fields Nome |Cod.Postal Localidade 
=  Character-based EDER Em 
= Toke-based José 2780 Oeiras 
= Phonetic algorithms anden aa eee. 


a Record matching 


= Rule-based CJ 


= Probabilistic Localidade=>Cod.Postal 


Data quality improvement 


= Includes often: 
a Data transformation — set of operations that 
source data must undergo to fit target schema 
a Data cleaning- detecting, removing and 
correcting dirty data (including approximate 
duplicate elimination) 
a Data enrichement- use of additional 
information to improve data quality 
= Main goal: 


a To correct the data quality problems detected 
during the data quality auditing process 


44 
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Typical techniques used in data 


cleaning and transformation 


= Dictionaries of words 
= Libraries of pre-defined cleaning functions 
= Machine learning techniques 


= Techniques for consolidating approximate duplicates 


Methodology for data cleaning 


1. Extraction of the individual fields that are relevant 

2. Standardization of record fields 

3. Correction of data quality problems at value level 
a Missing values, syntax violation, etc 

4. Correction of data quality problems at value-set level and record level 
a Synonyms, homonyms, uniqueness violation, integrity constraint violation, etc 

5. Correction of data quality problems at relation level 
a — Violation of functional dependencies, duplicate elimination, etc 

6. Correction of data quality problems problems at multiple relations level 
a — Referential integrity violation, duplicate elimination, etc 

= User feedback 


a To solve instances of data quality problems not addressed by automatic methods 


= Effectiveness of the data cleaning and transformation process must be 
always measured for a sample of the data set 
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Data Cleaning Tasks 


Extraction from sources 
= Technical and syntactic obstacles 


Transformation 


= Schematic obstacles 


= Syntactic and semantic obstacles 


Standardization 


Duplicate detection 


= Similarity functions 
= Algorithms 


Data fusion / consolidation 


= Semantic obstacles 


Loading into warehouse / presenting to user 
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Human Interaction is Needed 


a Components to implement 

= Wrappers for technical heterogeneity 
Schema integration based on correspondences 
Similarity measure for schema elements 
Similarity measure for records 


a Knobs to turn 


= Thresholds for similarity measures 


Partition size / window size 
a Expert guidance 


Rule selection / rule specification 


Schema matching 
Duplicate detection 
Data fusion 
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Outline 


= Introduction to data cleaning 

= Application contexts of data cleaning 
= Data quality dimensions 

= Taxonomy of data quality problems 
= Data quality process 

> Main data quality tools 

= Real-world examples 


Existing technology for ensuring 


data quality 


Ad-hoc programs written in a programming language like 
C or Java or using an RDBMS proprietary language 
a Programs difficult to optimize and maintain 


RDBMS mechanisms for guaranteeing integrity 
constraints 
a Do not address important data instance problems 


Data transformation workflow scripts using a data 
cleaning/profiling tool 
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Existing technology for ensuring 


data quality 


Ad-hoc programs written in a programming language like 
C or Java or using an RDBMS proprietary language 
a Programs difficult to optimize and maintain 


RDBMS mechanisms for guaranteeing integrity 
constraints 
o Do not address important data instance problems 


> Data transformation workflow scripts using an data 
cleaning/profiling tool 


Criteria for comparing commercial 


data quality tools (1) 


Debugger: 

Data lineage: data lineage or provenance identifies the 
set of source data items that produced a given data item 
Breakpoints: breakpoints is an intentional stopping or 


pausing place in a cleaning program put in place for 
debugging purposes 


Edit values: the user can edit values during debugging 
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Criteria for comparing commercial 


data quality tools (2) 


Profiling: 
Rules: A rule is a business logic that defines conditions 


applied to data. They are used to validate the data and to 
measure data quality 


Filters: A filter is used to split the data tuples in different 


groups. Each group should be validated by a different set of 
rules. 


Criteria for comparing commercial 


data quality tools (3) 


Execution: 


User involvement: Support for user interaction in a data 
cleaning process 
Incremental updates: The ability to incrementally update 


data targets, instead of rebuilding them from scratch every 
time 
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Commercial Data Cleaning Tools(2014) 
(1/3) 


Debugger Profiling Execution 
Data Incremental 
Tools lineage | Breakpoints | Edit values Rules Filters involvement updates 
Informatica 
PowerCenter Y Y Y Y ¥ N bd 


IBM Information 
Server 


[Talend Open Studio 


Oracle Data 
Integrator 


SQL Server 
Integration Services 


SAS Data 
Integration Studio 


Pentaho Data 
Integration 


Clover ETL 


Criteria for comparing commercial 


data quality tools (4) 


Extensibility: 
Create operators: the user can define new operators 


Modify operators: the user can modify standard 
operators 


User Interface: 


Drag and drop: the user can define data quality 
processes using a drag and drop interface 


Editor: the user can define and edit data quality processes 


modeled as workflows using a graphical interface 
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Commercial Data cleaning tools (2014) 
(2/3) 


Extensibility User Interface 
Tools Create Operators | Modify Operators Drag and Drop Grahical Editor 
Informatica PowerCenter Y (Java) N Y Y 
IBM Information Server Y (Java) N y Y 
[Talend Open Studio Y (Java, Groovy) Y (Java) XY Y 
Oracle Data Integrator ¥ Y y Y 
SQL Server Integration Services) Y (CH, VB) N há Y 
ISAS Data Integration Studio Y (SAS) Y (SAS) há há 
Pentaho Data Integration Y (Javascript) N y Y 
Clover ETL Y(CTL) N Y Y 


Criteria for comparing commercial 


data quality tools (5) 
Scalability: 


Grid: the tool can run a cleaning process on a collection of 
computer resources from multiple locations 


Partitioning: the user can partition the data and run each 
partition independently (on different CPUs or cores) 


Pushdown optimization: the tool translates the 
transformation logic into SQL queries and sends the SQL 
queries to the database. The database engine executes the 
SQL queries to process the transformations 


Others: 
Free version: the tool has a free version 


58 
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ommercial Data Cleaning Tools (2014) 


(3/3) 


Scalability Others 
Tools Grid Partitionin Pushdown Optimization Free version 

Informatica PowerCenter ¥ Y Y y 
IBM Information Server Y Y Y N 
[Talend Open Studio Y N Optional ELT Y 
Oracle Data Integrator Y N ELT Ye 
SQL Server Integration Services) N Y Y (IST) 
SAS Data Integration Studio Y Y Y Y (IST) 
Pentaho Data Integration Y x N Y 
Clover ETL Y Y N Y 
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Research Data cleaning tools (2014) 
(1/2) 


Detection DQ problems 


Repair DQ problems 


Tools Constraints | Satistical | Search | ML/St ER nec 
Cleenex QCs N N N Y 
Llunatic Egds N Y N N 
Nadeef CFDs, MDs N Y N N 
Guided data repair CFDs N Y X N 
Scare N Y N bi N 
Eracer N bi N Y N 
Continuous data 
cleaning FDs N Y ki N 
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riteria for comparing research 
data cleaning tools (1) 


Detection: 


Constraints — use of rules or/and conditions 
= EGDs - equality generating dependencies 
= QCs - quality constraints 
= CFDs - Conditional functional dependencies 
= MDs - Matching dependencies 


Statistical — dirty tuples are detected based on simple 
statistics or in complex data analysis 


Criteria for comparing research 


data cleaning tools (1) 
Repair: 


Search: The system explores the space of possible clean 
tables and heuristically selects the best table 

ML/St: The system uses machine learning and/or 
statistical models to infer data values or to prune the search 


Data transformations: The system models the data 
cleaning process as a data transformation graph 
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Criteria for comparing research 


data cleaning tools (3) 


User Interface: 


Graphical interface: the system provides a visualizing tool and 
menus to interact 


User edition: the system allows the user to edit data values 


Others: 


Scalability: the system execution time grows linearly with the 
number of input tuples 


Streaming: the system receives tuples and processes each of 


them treat them indivually (opposed to batch processing) 


Extensible: the system allows the user to modify and/or insert 


_new algorithms Ż ž ž ž  ć o śãăäăś oS 
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Research Data cleaning tools (2014) 


(1/2) 


User Interface 


Others 


Tools Graphical Interface| User edition Extensible Streamin Scalability 
Cleenex Y Y Matching algorithms N N 
Llunatic Y há Cost Managers N Y 
Nadeef Y N Repair algorithms N N 
Guided data repair N Y N N N 
Scare N N N N Y 
Eracer N N N N N 
Continuous data cleanin N N N Y Y 
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Outline 


= Introduction to data cleaning 

= Application contexts of data cleaning 
= Data quality dimensions 

= Taxonomy of data quality problems 
= Data quality process 

Main data quality tools 

Real-world examples 


Vv 
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Death by Typo 


‘Resurrected,’ but still wallowing in red tape 
Government records incorrectly kill off thousands, and there’s no easy fix 


By Alex Johnson and Nancy Amons 
Reporters 

MSNBC and NBC News 

updated 6:21 p.m. ET Feb, 29, 2008 


For a dead woman, Laura Todd is awfully 
articulate, 


“I don’t think people realize how difficult it is to 
be dead when you're not,” said Todd, who is 
very much alive and kicking in Nashville, Tenn., 
even though the federal government has said 
otherwise for many years. 


Todd’s struggle started eight years ago with a 


typo in government records, The government 
has reassured her numerous times that it has 
cleared up the confusion, but the problems 
keep coming. 


Story continues below | 


fa Does this woman look dead to you? 

The government says Toni Anderson is dead, but she 
insists she is very much alive. David MacAnally of NBC 
affiliate WTHR reports from Muncie, Ind. 


NBC News Channel 
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Google searches for Britney Spears 


488941 britney spears 
40134 brittany spears 
36315 brittney spears 
24342 britany spears 
7331 britny spears 
6633 briteny spears 
2696 britteny spears 
1807 briney spears 
1635 brittny spears 
1479 brintey spears 
1479 britanny spears 
1338 britiny spears 
1211 britnet spears 
1096 britiney spears 

991 britaney spears 
991 britnay spears 
811 brithney spears 
811 brtiney spears 
664 birtney spears 
664 brintney spears 
664 briteney spears 
601 bitney spears 
601 brinty spears 
544 brittaney spears 
544 brittnay spears 
364 britey spears 
364 brittiny spears 
329 brtney spears 
269 bretney spears 
269 britneys spears 
244 britne spears 
244 brytney spears 
220 breatney spears 
220 britiany spears 


29 britent spears 
29 brittnany spears 
29 britttany spears 
29 btiney spears 
26 birttney spears 
26 breitney spears 
26 brinity spears 
26 britenay spears 
26 britneyt spears 
26 brittan spears 
26 brittne spears 
26 btittany spears 
24 beitney spears 
24 birteny spears 
24 brightney spears 
24 brintiny spears 
24 britanty spears 
24 britenny spears 
24 britini spears 
24 britnwy spears 
24 brittni spears 
24 brittnie spears 
21 biritney spears 
21 birtany spears 
21 biteny spears 
21 bratney spears 
21 britani spears 
21 britanie spears 
21 briteany spears 
21 brittay spears 
21 brittinay spears 
21 brtany spears 
21 brtiany spears 
19 bimey spears 


Fel inda PPA Profit ney spears 
ritnry spears | yyy: ritnaey spears 
and RN ibe Ww Intet g pritnee spears 


2014y4Pbrittiney spears 
147 britty spears 


AAT ionni 


19 britony spears 
19 brittanty spears 
AO bein 


9 brinttany spears 
9 britanay spears 
9 britinany spears 
9 britn spears 
9 britnew spears 
9 britneyn spears 
9 britney spears 
9 brtiny spears 
9 brtittney spears 
9 brtny spears 
9 brytny spears 
9 rbitney spears 
8 birtiny spears 
8 bithney spears 
8 brattany spears 
8 breitny spears 
8 breteny spears 
8 brightny spears 
8 brintay spears 
8 brinttey spears 
8 briotney spears 
8 britanys spears 
8 britley spears 
8 britneyb spears 
8 britnrey spears 
8 britnty spears 
8 brittner spears 
8 brottany spears 
7 baritney spears 
7 birntey spears 
7 biteney spears 
7 bitiny spears 
7 breateny spears 
7 brianty spears 
7 brintye spears 
7 britianny spears 
7 britly spears 
7 britnej spears 67 
7 britneyu spears 
an 


5 brney spears 

5 broitney spears 
5 brotny spears 

5 bruteny spears 
5 btiyney spears 
5 btrittney spears 
5 gritney spears 
5 spritney spears 
4 bittny spears 

4 bnritney spears 
4 brandy spears 
4 brbritney spears 
4 breatiny spears 
4 breetney spears 
4 bretiney spears 
4 brfitney spears 
4 briattany spears 
4 brieteny spears 
4 briety spears 

4 briitny spears 

4 briittany spears 
4 brinie spears 

4 brinteney spears 
4 brintne spears 

4 britaby spears 

4 britaey spears 

4 britainey spears 
4 britinie spears 

4 britinney spears 
4 britmney spears 
4 britnear spears 
4 britnel spears 

4 britneuy spears 
4 britnewy spears 
4 britnmey spears 
4 brittaby spears 
4 brittery spear“ 

4 britthey spea Source: 
4 brittnaey spears 


PE 


3 britiy spears 

3 britmeny spears 
3 britneeey spears 
3 britnehy spears 
3 britnely spears 
3 britnesy spears 
3 britnetty spears 
3 britnex spears 

3 britneyxxx spears 
3 britnity spears 

3 britntey spears 
3 britnyey spears 
3 britterny spears 
3 brittneey spears 
3 brittnney spears 
3 brittnyey spears 
3 brityen spears 
3 briytney spears 
3 britney spears 

3 broteny spears 
3 brtaney spears 
3 brtiiany spears 
3 brtinay spears 

3 brtinney spears 
3 brtitany spears 
3 brtiteny spears 
3 brtnet spears 

3 brytiny spears 

3 btney spears 

3 drittney spears 
3 pretney spears 
3 rbritney spears 
2 barittany spears 
2 bbbritney spears 
2 bbitney spears 
2 bbritny spears 


2 hilarii enanee 


: http:/Awww.google.com/jobs/britney.html 


2 brirreny spears 
2 brirtany spears 
2 brirttany spears 
2 brirttney spears 
2 britain spears 

2 britane spears 

2 britaneny spears 
2 britania spears 
2 britann spears 

2 britanna spears 
2 britannie spears 
2 britannt spears 
2 britannu spears 
2 britanyl spears 
2 britanyt spears 
2 briteeny spears 
2 britenany spears 
2 britenet spears 
2 briteniy spears 

2 britenys spears 
2 britianey spears 
2 britin spears 

2 britinary spears 
2 britmy spears 

2 britnaney spears 
2 britnat spears 

2 britnbey spears 
2 britndy spears 

2 britneh spears 

2 britneney spears 
2 britney6 spears 
2 britneye spears 
2 britneyh spears 
2 britneym spears 
2 britneyyy spears 
2 britnhey spears 
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POSTAGE PAID GB 
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Dr Felix Maumann 
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FIFA registration form (2010) 


Nationality 
Country of Residence 
Mother Tongue 

Preferred FIFA Language 
Secondary FIFA Language 


1 Details 


Organisation Name 


Organisation Role (Prof) 


Notes (Max 2000 chars) 


German Democratic Republic 


with a public account such as Hotmail or) 


German Umlaute 


O, -uni-trier. de 
A 
Q 
ke] 


Search Results for 'dessloch' 


DBLP: [Home | Search: Author, Title | Conferences | Journals] 
Michael Lev (ley@uni-trier.de) Thu Jan 31 10:44:06 2008 
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Next lecture 


= Data Matching 


Follow me on Linkedin for more: 
Steve Nouri 
https://www.linkedin.com/in/stevenouri/ 
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